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SPEECH ANALYSIS METHOD AND 
SPEECH SYNTHESIS SYSTEM 



BACKGROUND OF THE INVENTION 

The present invention relates to a so-called speech 
analysis-synthesis system, which analyzes speech waveform to 
represent it as parameters, compresses/stores the parameters , and 
then synthesizes the speech using the parameters. 

j ? In a speech analysis-synthesis system, which is called 

= n 

1 10 "vocoder", speech signals are effectively represented as a few 
f] parameters by modeling and the original speech is then synthesized 

2 from parameters. The speech analysis-synthesis system allows 

3 speech to be transmitted in a far smaller data amount than in the 
1 case where the speech is transmitted as waveform data. For this 
O 15 reason, the speech analysis-synthesis system has been used in 

speech communication systems. One of typical speech 
analysis-synthesis systems is the LPC (linear prediction coding) 
analysis-synthesis system. 

However, a speech synthesized by an LPC vocoder or any of 
20 many other vocoders sounds unnatural as a human speech in no small 
way. The LPC vocoder is a model in which a sound source for voiced 
sounds is assumed as an impulse series and a sound source for unvoiced 
sounds is assumed as white noise. Thus, voiced regions of the 
speech have buzzy sound quality. Also, the waveform of a vocal 
25 tract vibrations is different from the impulse series and thus 
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the effects of a spectral tilt or the like of the sound source 
cannot be correctly taken into account. As a result, estimation 
errors of vocal tract transfer characteristics increase. 

Then, a method for estimating a vocal tract parameter and 
5 a voice source parameter simultaneously using a glottal waveform 
model as a sound source has been invented. Ding et al. have 
developed a pitch-synchronous speech analysis-synthesis method 
based on an ARX ( autoregressive-exogenous ) speech production model 
O (Ding, W., Kasuya, H. , and Adachi, S., "Simultaneous Estimation 

;"2 10 of vocal Tract and Voice Source Parameters Based on an ARX Model" , 
^ IEICETrans. Inf . &Syst., Vol. E78-D, No. 6 Junel995). Themethod, 

?! however, has encountered deficiencies in the analysis of voices 

h of brief pitch periodicity and transitional portions between 

m 

□ vocalic and consonantal segments. 

a 15 
ru 

SUMMARY OF THE INVENTION 

According to an aspect of the present invention, a speech 
synthesis system, which synthesizes speech using time series data 
of f ormant parameters ( including a f ormant frequency and a f ormant 
20 bandwidth) estimated based on a speech production model, includes 
determining the correspondence of formant parameters between 
adjacent frames using dynamic programming. 

Preferably, in the speech synthesis system, in determining 
the correspondence of the formant parameters, a connection cost 
25 d c (F(n), F(n+1)) and a disconnection cost dc(F(k)) are obtained 
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using the equations: 

d e (F(/i),F(/ l + l)) = «|i r / («)-i ; ' / («+l)| + / 3 l^(' l )-- F i ( ,,+1 )l 

= P\W)-e\ 

where a and pare predetermined weight coefficients, F f (n) is a 
formant frequency in the n th frame, that Fi(n) is a formant intensity 
in the n th frame and e is a predetermined value, and the resultant 
d c ( F ( n ), F ( n+1 )) and da ( F ( k )) are used as cost s for grid point shifting 

in dynamic programming. 

preferably, in the speech synthesis system, for two adjacent 
frames in which exists a formant which has no counterpart to be 
connected, a formant having the same frequency as that of the 
disconnected formant in one of the frames and an intensity of 0 
is located in the other frame and the two adjacent frames are 
5 connected by interpolation of frequencies and intensities of both 

the formants according to a smooth function. 

Preferably, in the speech synthesis system, the formant 
intensity F±(n) is calculated using 



m 

■f! 10 
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201og 10 ^ l _ e . aFt( „ )/F J , if formant 
201og 10 1 + e -^ in)IF , • if antl " formant 



where F b (n) is a formant bandwidth in the n th frame and F s is a sampling 
20 frequency . 

Preferably, in the speech synthesis system, a vocal tract 
transfer function including a plurality of formants is implemented 
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by a cascade connection of a plurality of filters , and when a formant 
which has no counterpart to be connected exists in the adjacent 
frames and thus the connection of the filters needs to be changed, 
a coefficient and an internally stored data of the filter in question 
5 are copied into another filter and the first filter is then 
overwritten with a coefficient and an internally stored data of 
still another filter or initialized to predetermined values. 

According to another aspect of the present invention, a 
speech analysis method, in which a sound source parameter and a 
10 vocal tract parameter of a speech signal waveform are estimated 
by using a glottal source model including an RK voicing source 
model , includes the steps of extracting an estimated voicing source 
waveform using a filter which is constituted by the inverse 
characteristic of an estimated vocal tract transfer function, 
15 estimating a peak position corresponding to a GCI ( glottal closure 
instance) of the estimated voicing source waveform with higher 
accuracy at closer time intervals than that with the sampling period 
by applying a quadratic function, synthesizing the GCI with a 
sampling position in the vicinity of the estimated peak position 
20 and thereby generating a voicing source model waveform, and 
time-shifting the generated voicing source model waveform with 
higher accuracy at closer time intervals than that with the sampling 
period by means of all pass filters and thereby matching the GCI 
with the estimated peak position. 
25 According to still another aspect of the present invention, 
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a speech analysis method, in which a voicing source parameter and 
a vocal tract parameter of a speech signal waveform are estimated 
by using a glottal voicing source model such as an RK model or 
a model defined as an extended model thereof, includes the steps 
of extracting an estimated voicing source waveform using filters 
which are constituted by the inverse characteristic of an estimated 
vocal tract transfer function, and assuming the first harmonic 
level as HI and the second harmonic level as H2 in DFT (discrete 
Fourier transformation) of the estimated voicing source waveform 
and estimating an OQ (open quotient) from a value for HD defined 
as HD=H2-H1 . 

Preferably, in the speech analysis method, for estimating 
the OQ, the relation: 

OQ=3 . 6 5HD-0 . 2 7 3ffD 2 +0 .022 4HD 3 +5 0 . 7 
is used. 



BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram illustrating an ARX speech 

production model . 

FIG. 2 is a graph showing a relationship between the OQ 
parameter of an RK model and the difference between the first 
harmonic level and the second harmonic level. 

FIG. 3 is a graph showing an exemplary voicing source pulse 
waveform when all pass filters are used, in which (a) indicates 
an original waveform, (b) indicates a waveform which has been 



shifted by T d =50 t xs and (c) indicates another waveform which has 
been randomized by d g =3ms and then shifted. 

FIG. 4A is a graph showing discrete formants; and FIG. 4B 
is a graph showing changes of spectra of the formants. 

FIG. 5 is a graph showing the evaluation results of acoustic 
experiments . 

FIG. 6 is a block diagram illustrating the configuration 
of a speech analysis system according to a first embodiment of 
the present invention. 

FIG . 7 is a chart for illustrating the flow of a speech analys is 

process . 

FIG . 8 is an illustration of how the AV parameter is obtained . 
FIG. 9 is a graph illustrating the concept of polar 
coordinates of a complex number. 

FIG. 10 is an illustration of how GCIs are estimated with 

high accuracy. 

FIG. 11A and 11B are illustrations of how an RK model voicing 
source waveform is shifted using all pass filters with higher 
accuracy than that in shifting by the sampling period. 

FIG. 12 is a block diagram illustrating the configuration 
of a speech synthesis system according to a third embodiment of 
the present invention. 

FIG. 13 is a block diagram illustrating the configuration 
of an RK model voicing source generation unit in a speech synthesis 
system according to a fourth embodiment of the present invention. 



FIG. 14 is a block diagram illustrating the configuration 
of a speech synthesis system according to a fifth embodiment of 
the present invention. 

FIG. 15 is a chart showing a relationship between formant 
frequency and bandwidth for two adjacent f ormants . 

FIG. 16 is an illustration of the concept of a grid in which 
formants in Frame A are laid off as abscissas and f ormants in Frame 
B are laid off as ordinates . 

FIG. 17 is an illustration of a grid in the case where all 
the formants are connected with their counterparts having the same 
number . 

FIG. 18 is an illustration of a grid in the case where a 
disconnected formant exists . 

FIG. 19 is an illustration of constraints on a shift. 

FIG. 20 is a chart showing grid points through which a path 
can pass under the constraints of FIG. 19. 

FIG. 21 is a chart for illustrating the flow of a path search 

process . 

FIG. 22 is an illustration of exemplary costs which have 
been calculated by a path search process. 

FIG. 23 is a chart showing how Path B has been selected. 

FIG. 24 is a chart showing the obtained optimum path. 

FIG. 25 is a chart showing how a formant has been connected 
according to an optimum path. 

FIG. 26 is a chart in which Frame A and Frame B and their 



vicinity have been enlarged. 

FIG. 27 is a chart showing how a formant with an intensity 
of 0, intended for another formant which is in a frame and has 
no counterpart to be connected, is located in the corresponding 
5 frame. 

FIGS. 28A and 28B are diagrams illustrating the 
configurations of formant filters. 

FIG. 29 is a table for illustrating a modification method 
? of the cascade connection configuration of formant filters. 

I 10 FIG. 30 is a chart illustrating the flow of a modification 

^ process of the cascade connection configuration of formant filters . 



3 DESCRIPTION OF THE PREFERRED EMBODIMENTS 

i A speech analysis and synthesis method based on an ARX 

\ 15 (autoregressive-exogenous) speech production model will be 
summarized . 

[ARX SPEECH PRODUCTION MODEL] 

The ARX speech production model is shown in FIG. 1 and 
represented by a linear difference equation as 

20 y(n) + Jj a k y(n - k) = J? b k u(n -k) + e(n) ( 1 ) 

where the input u(n) denotes a periodic voicing source waveform 
and the output y{n) a speech signal. A glottal noise component 
is simulated by white noise e(n) . In the equation, a ; and h t are 
vocal tract filter coefficients, and p and q are ARX model orders 

25 We define A(z) and B(z) as 



A(z)=l + a 1 z- l + --- + « p z" i ' 
fl(z) = fo 0 + fr 1 z- 1 + -V"* 
Then the z-transform of Equation (1) can be written as 

where Y(z) , U(z) and E(z) are the z-transforms of y(n) , u{n) and 
5 e(ri), respectively. The vocal tract transfer function is given 
by B(z)/A(z) . 

We employ the RK (Rosenberg-Klatt ) model (Klatt, D. andKlatt, 
L. , "Analysis synthesis and perception of voice quality variations 
among female and male talkers.", J. Acoust. Soc. Amer. Vol. 87, 
10 820-857, 1990) for representing a differentiated glottal flow 
waveform, including radiation characteristics. The RK waveform 
is represented by 

rk(n) = rk c (nTJ (3) 
\2at-3bt 2 , 0*t<OQT0 

|0, elsewhere (4) 
21AV _ 21 AV 
a ~ 40Q 2 W ' 4OQ 2 T0 2 

15 where J, is a sampling period, AV an amplitude parameter, TO a 
pitch period and OQ an open quotient of the glottal open phase 
of the pitch period. The differentiated glottal flow waveform 
«(») is generated by smoothing rk(n) with a low-pass filter where 
the tilt of the spectral envelope is adjusted by a spectral tilt 

20 parameter TL . The low-pass filter is defined as 

TL(z) = (l-cz- l y 2 (5) 
and the low-pass filter coefficient c is related to the tilt 



parameter TL by 

TL = 201og 10 |7L(^°)| - 201og 10 |7L(e^)|, 

1 _^ ( 6 ) 

B - cosa> 0 - J(B- cosco 0 ) 2 -(B-l) 2 

c TTi 

where B = lO 71 ' 20 ,^ = 2^3000/F, . 
[ANALYSIS ALGOL ITYM] 
5 [Estimating Filter Coefficients] 

Although Ding et al. employs the Kalman filter algorithm 

0 to estimate point-by-point time-variant coefficients of the ARX 

1 model taking articulatory movement into account, only a single 
S set of coefficients within a pitch period has to be saved in most 
J 10 applications . The set of coefficients are obtained by averaging 

all the f ormant values having a bandwidth below 2 , 000Hz . However , 
5 the average coefficients are not likely to be appropriate when 

: 1 

S the f ormant with broad bandwidth is excluded in the calculation. 

m We use a simple LS (least square) method instead to estimate the 

15 averaged coefficients over the analysis frame. 
By defining q> and 6 as 

<p(n) = [- y(n - 1)- • ■ - y(n - p)u(n) • • • w(n - q)J , 
6 = [ ai -a p b 0 —b q ] 

Equation (1) can be written as 

3/(71) = <p T (n)d + e(n), n = X~,N ( 7 > 
20 The prediction error becomes 

e(n,0) = y(n)-<p T (,n)e ( 8 > 
and the least-squares criterion function is 
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The least-squares estimates are given by 
6 = argminV(0) 

(10) 

1 A 



[Compensating Spectral Tilt] 

Roots of A(z) and 2>(z) on the real axis and of very broad 
bandwidth must be excluded since they are not associated with the 
vocal tract resonance. Simple exclusion of the roots, however, 
alters the spectral tilt of the vocal tract transfer function. 
We introduce a system transfer function D(z) to compensate for 
the spectrum tilt of 

CCZ) A(z)B\z) 

where B\z)/A'(z) consists of formants that are not excluded. For 

approximating the spectrum tilt of C(z) , we define D(z) of a 

second-order pole or zero on the real axis, 

D{z)^{\-dz- 1 y^ 2 (12) 
where sgn(-) represents the sign of the value. Spectrum tilt 

parameter 77 is given by 

77 = 20 log l0 |C(^°)| - 20 log 10 |<V° )| 
where «, 0 = MOOO/F, . The coefficient d in Equation (12) is 
derived from TL in the same way as Equation (6). 



[Generating Voice Source] 
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We generate a multiple pulse source signal of an arbitrary length, 
for obtaining more stable estimates of the f ormants . The multiple 
pulse source signal v(n) is given by 

v (n)=^rk{n-OQT0F s +GCI(i),AV{i),T0,OQ) (13) 

TO is an averaged value of the pitch periods in the analysis frame. 
The initial value of OQ is set at an appropriate value. Voicing 
amplitude parameter AV(i) and glottal closure instant GCI(i) are 
obtained from excitation peaks of inverse filtered speech v'(») , 
whose z-transform is given as 



A'(l) V 



A\z) 



-Y(z) 



(14) 



\'Q)D(\)TLQ) j B'(z)D(z)TL{z) 
The excitation amplitude AE of v'(n) is converted to the AV 

parameter 



AV =^-OQAE 
27 



(15) 
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[Adaptive Prefilter] 

Equation (9) can be expressed in the frequency domain using 
parseval's relationship as follows (Ljung, L., "System 
identification theory for the user." PRENTICE HALL PTR, 201-202, 
1995) 



20 



1 JV-l 



G(e 



> .2* k 

A(e' N ,6) 



(16) 



where 
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(17) 



G(e ,**," ) = _JL_ ; 

K N 

W{a>,0)-U(a>)A(e Ja ,0) 
FromEquation (16), the prediction-error method can be interpreted 
as a method of fitting the model vocal transfer function to the 
empirical transfer-function estimate ( ETFE ) G(e'"'*) with 
5 weighting function W(co,d) . 

If the input signal and the output signal of the system is 

prefiltered with L{z) 

L(z) = l + / 1 z- 1 + / 2 z- 2 - + / r z- < 18 > 
the weighting function can be rewritten as 
10 W{a>,d)-U(a>)A(e s '',B)L(e»') (19) 

which implies that W(a>,9) can be controlled by a pref liter L(z) . 
in the ARX speech production model , the spectral tilt of the voicing 
source U(p) is determined by TL , and the spectral tilt of A(e*°) 
is assumed to be flat in a wide frequency range although A(e") 
15 has anti-resonance in a local frequency range . Ding et al . ignored 
the effects of the spectral tilt parameter TL and used an invariant 

filter L{z), such as L(z)=l-z~ l . 

We employ an adaptive pref liter L(z) taking into account the 
effects of TL in order to cancel out U(co) in weighting function 
20 W(co). The coefficients of prefilter L(z) are obtained form the 
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next AR model using the LS method, 

M („) = J|/ tl /(K -*) + £(") (20) 
where the model order is r , typically 6 or 8 , and |(n) is white 
noise. 

5 

[Estimating Open Quotient] 

Open quotient OQ of the RK model is primarily related to 
the first harmonic level (HI) and the second harmonic level (H2 ) 
of the multi pulse source, as shown in FIG. 2. is given 

10 by the following equation, 

OQ = 3.65HD - 0.213HD 2 + 0.0224HD 3 + 50.7, {21) 
- 4.03 ^ HD £ 9.83 

where HD = H2-H1 [dB] , and H2 and HI are obtained from the DFT 
of inverse filtered speech, given by Equation (14). 



15 [SYNTHESIS ALGORITHM] 

A cascade formant synthesizer is used to synthesize both 
voiced and unvoiced speech. The RK model is used to synthesize 
voiced speech, whereas theM-sequence, pseudo random binary signal, 
is used to synthesize unvoiced speech. 



20 



[Voicing Source Control] 

We apply two all pass filters (APF ) (Kawahara, H., "Speech 
representation and transformation using adaptive interpolation 
of weighted spectrum: vocoder revisited", ICASSP 97, 1303-1306, 
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1997) to the RK voicing source in order to solve two problems: 

- Since the interval between two successive glottal closure 
instants (GCIs) can be considered as the cue of human cognition 
of FO, we have to carefully control the position of the RK 
waveform. 

- Since a constant sequence of the voicing source waveform causes 
buzzy sound quality, certain fluctuations must be introduced 
into the source waveform. 

An improved voicing source rk'(n) follows the next equation. 

*'<»>-;kl/<>'"'" 



(22) 



R'(^-k) = R(^-k)e } ^- e ^ 



where R(p*/N) is the DFT of Equation (3), 

Phase e,(*) shifts by [sec] the voicing source waveform, 

G(*)-^* (24) 

N F s 

0 r (*) , on the other hand, randomizes the group delay in the higher 
frequency range, 



e r <*)- 



- v \-k), *— ^+V-1 
2t A , IWI v (25) 



l 

rj(!)~~N(p,d,F,), /-0,-,y 
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The group delay t](l) is white noise with zero mean and variance 
d g F s [point]- Weighting window w„(l) is used to manipulate phase 
in high frequency defined by a cutoff frequency o> e [rad] (typically, 
2x100/ F s ). An example is shown in FIG. 3. 

[Optimum Formant Connection] 

The automatic estimation described above does not always 
guarantee that the coefficients of the vocal tract transfer 
function will vary continuously, in the formant synthesizer which 
is a time-variant system, discontinuity of the digital filter 
coefficients causes click sounds. Discontinuity will occur in 
two cases , 1 ) if the number of f ormants between two successive frames 
is not the same, 2 ) if a formant frequency changes abruptly. 

Dynamic programming is applied to attain an optimum match 
between the f ormants F(n) and F(n+\) with a distance measure 
consisting of connection cost d c {F(n),F{n + 1)) and disconnection 
cost d d (F(k)). 

d c (F (»), F(n + 1)) = a\F f (n) -F f (n + 1)| + fi\F t - F l (n + 1)| (26) 
d d <F(Jfc)) = a\F f (k) - F, (*)| + fi]F t (k)-e\ ( 2 7 } 

- /tow -«| 

where F f is the formant frequency and F x is the formant intensity . 
The formant intensity F ; is defined as the difference between the 
maximum and minimum levels of the spectrum of the formant. 
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F,(ii)-< 



201og.J 1 , if formant 



10 1 1 -*F t (n)IF, 

v ~ e ) (28) 

1 + e' 



201og l0 [ ?" e ,^ (B)/f J , if anti -formant 



When the formant does not have a counterpart, a formant with 
the same frequency and an intensity of a small value e is regarded 
as the formant to be connected. The results of a simulation of 
optimum formant connections show that spectral envelopes vary 
smoothly even if the formant frequency varies rapidly, as seen 
in FIG. 4. 

[EXPERIMENTS] 

A long Japanese sentence read by 18 males and 5 females was 
subjected to analysis-synthesis experiments . The 18 talkers were 
selected from a speech data corpus of 108 males that were prepared 
for research on voice quality variations associated with talker 
individuality and were regarded as representing enough of the 
original 108 males in terms of voice quality variations (Ljung, 
L, "System Identification theory for theuser." PRENTICE HALL PTR, 
201-202, 1995). After confirming the superiority of the proposed 
method to the one by Ding et al. in synthetic sound quality, a 
further comparison was made between a well-known mel cepstral 
(MCEP) method (Tokuda, K. , Matsumura, H . , and Kobayashi, T. "Speech 
coding based on adaptivemel-cepstral analysis . " ICASSP 94, 197-200, 
1994) and our ARX method. The same speech samples as in the previous 
experiment were used. The sampling frequency for digitization 
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was 11.025 kHz. In order to test robustness against pitch 
conversion, speech samples were also re-synthesized which have 
the higher fundamental frequency than the original by 1.5 times. 
A paired comparison test was made by five subjects who were asked 
5 to choose more naturally sounding synthetic stimulus . Results are 
illustrated in FIG. 5, where statistics are made for the two pitch 
groups, low and high pitch, and pitch conversion. Although the 
difference is small for the low pitch speech data-between the ARX 
Q and MCEP methods , it is clear that the ARX method works much better 

;C 10 for high pitch voices. 



EMBODIMENT 1 

FIG. 6 is a block diagram illustrating the configuration 
of a speech analysis system according to a first embodiment of 
15 the present invention. This system operates in accordance with 
the flow shown in FIG. 7. Hereinafter, how the system operates 
will be described with reference to FIGS. 6 and 7. 

A speech segment 601 is cut out from a speech waveform to 
be analyzed using a window function with a window length of about 
20 25-35 msec. The well-known Hanning window or the like is used 
as the window function. Such a window length of 25-35 msec is 
considerably long, compared to those used in conventional analysis 
methods, and corresponds to almost the same as the total length 
of several pitch periods together cut out from a speech waveform 
25 having a normal pitch range for a male or female speech. 



m 
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<Step S7001> 

Next, in a GCI searching unit 602 and an AV estimation unit 
603, peak picking in a negative direction is performed to obtain 

5 GCIs (glottal closure instances) and an initial value for AV from 
the speech segment 601 . As for GCIs , peak positions of the speech 
segment 601 in a negative direction are used. AV is obtained using 
■Equation (15) and as shown in FIG. -8 so that the peak values of 
the speech segment 601 correspond to peaks of an RK voicing source 

10 in a negative direction. 

<Step S7002> 

Next, in a voicing source waveform generating unit 604, an 
RK model waveform shown in FIG. 1 and Equation (3) is generated 

15 so that its negative peak positions are synchronized with the GCIs , 
thereby generating a voicing source waveform 605. In this case, 
used as parameters for the RK model are the value obtained in Step 
S7001 for AV, 0.6 for OQ° which is the initial value of OQ, and 
an appropriate value selected from between 5 and 15 for TL° . TO 

20 is an average pitch period in a frame to be analyzed. The voicing 
source waveform generating unit 604 generates the voicing source 
605 according to Equation (13). 

<Step S7003> 

25 Next, in an AR analysis unit 606, the voicing source waveform 
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605 which has been generated by the voicing source waveform 
generating unit 604 is AR analyzed. Use is made of 6 or 8 for 
the model order of the AR analysis. Adaptive pre-emphasis is 
performed on the voicing source waveform 605 and the speech segment 
601 by adaptive pre-emphasis filters 607 and 608 with the filter 
coefficients which have been obtained by the AR analysis. The 
adaptive pre-emphasis filters 607 and 608 can be represented by 
Equation ( 18 ) . 



<Step S7004> 

Next, in an ARX analysis unit 609, ARX analysis is conducted 
using the voicing source waveform 605 and the speech segment 601 
which have been adapt ively pre-emphasized by the adaptive 
pre-emphasis filters 607 and 608 in the manner shown in Equations 
(7) through (10). As a result, an AR coefficient a* and an MA 
coefficient b t are obtained from Equation (10) and thereby A(z) 
and B(z) in Equation (2) are determined. Then, by solving the 
below equations where A( z )=0 and B( z ) =0 , a f ormant f requency Ft ( n) , 
a f ormant band-width F„(n) , an anti-f ormant f requency AF f (n) , and 
an anti-f ormant band-width AF b (n) are obtained. That is to say, 
if the complex number solution of A(z)=0 is represented by r lt 
—, r p , and the complex number solution of B(z)=0 is represented 
by St,—, s ql F f (n), F b (n), AF f (n), andAF b (n) can be obtained from 
the following equations : 
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where the following equations are given. 

Im(c) 



argc = arctan 



Re(c) 



|c| = ^/Re 2 (c)+Im 2 (c) 

These equations express the complex number c by polar coordinates, 

as shown in FIG. 9. 

Note that the formant with a broad bandwidth is excluded 
here. The formant exclusion results in effects on an estimated 
spectral tilt, and thus TI is estimated in the manner shown in 
Equations (11) through (12). 



<Step S7005> 

Next, an inverse filter 610 shown in Equation (14) is 
constructed us ing the formant parameters Ft ( n ) , F b ( n ) , the spectral 
tilt TI, and the voicing source spectral tilt TL° , which have been 
15 estimated, and then a voicing source waveform 611 is estimated 
from the speech segment 601. 



<Step S7006> 

Next, in an OQ estimation unit 612, OQ is estimated. 
20 Specifically, HD=H2-H1 , which is the difference between the first 
harmonic level Hi and the second harmonic level H2 , is obtained 
from DFT (discrete Fourier transform) of the voicing source 
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waveform 611 which has been estimated by the inverse filter 610, 
and thereby OQ is estimated using Equation (21). 

<Step S7007> 

5 Next, in the GCI searching unit 602 and the AV estimation 

unit 603, peak picking in a negative direction is performed on 
the voicing source waveform 611 having been estimated by the inverse 
filter 610 and thereby GCIs and a value for AV are obtained from 

P the voicing source waveform 611 . GCIs and AV are obtained in the 

•.n 

; 5 10 same manner as described in Step S7001. 

[S\ 
y I 

?\ <Step 7008> 

~ Next, in a determination unit 613, it is determined whether 

§ GCIs converge to a predetermined value. If GCIs do not converge, 

3 15 the process will repeat estimation from Step S7002. If GCIs 
converge, the process completes the analysis of the current frame 
and will proceed with the analysis of the next frame. Note that 
the period of a frame is preferably 5-10 ms . 

As has been described, in the speech analysis system 
20 according to the first embodiment, voicing source parameters and 
a glottal transfer function can be estimated with high accuracy 
from a female speech of a high pitch frequency or the like, by. 
setting the analysis window length at about 25-35 msec, which is 
longer than that in a conventional system and then estimating 
25 voicing source positions of multiple pitch frequencies at a time, 
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or the like. 



EMBODIMENT 2 

in the first embodiment, GCI estimation by peak picking in 
5 Steps S7001 and S7007 is performed for each sample. In a second 
embodiment of the present invention, GCI estimation is carried 
out with higher accuracy at closer intervals than that with a 
sampling period and an RK voicing source wavef orirrhighly accurately 
S synchronized with GCls is generated in Step S7002, resulting in 

10 improved analysis accuracy. 
3 A method for highly accurate GCI estimation is shown in FIG. 

3 10. Negative peak positions of the speech segment 601 or the 

voicing source waveform 611 which has been estimated by the inverse 
filter 610 are accurately obtained by secondary interpolation. 
15 Specifically, a peak 8001 is detected for each sample, a quadratic 
function 8004 is obtained whose graph contains three points of 
the peak 8001, its previous sample 8002 and its subsequent sample 
8003 , and a peak position 8005 and a peak value 8006 of the quadratic 
function 8004 are then obtained. 

The peak pos it ion 8005 , which has been obtained in this manner , 
is a GCI, but a value for the GCI is represented not by a sampling 
position of integral but by a real number. In order to adjust 
negative peak positions of the RK voicing source model to CGI 
positions represented by real numbers , the RK voicing source model 
25 is time-shifted using all pass filters. In other words, the RK 
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voicing source model which corresponds to a pitch period is shifted 
according to Equations (22) through (24). Note that 0 r (*) = O is 
applied here. T d in Equation (24) may be replaced with a time 
difference between the estimated peak position 8005 and the sample 
position 8002 located right before the peak position 8005 which 
are shown in FIG. 10. 

FIG. 11A shows an exemplary RK voicing source waveform being 
shifted with higher accuracy at closer time intervals than that 
with the sampling period by means of the all pass filters. In 
the graph shown in FIG. 11B, an original RK voicing source waveform, 
a 0.5 point-shifted RK voicing source waveform, and a 0.9 
point-shifted RK voicing source waveform are represented in 
overlapping relation. In this manner, by synchronizing negative 
peak positions of the RK model waveform with GCIs with higher 
accuracy at closer time intervals than that with the sampling period , 
the analysis accuracy can be improved. 

As has been described above, in the speech analysis system 
according to the second embodiment , in estimating of voicing source 
positions , negative peak positions of a speech segment or a voicing 
source waveform having been estimated by an inverse filter are 
accurately obtained by secondary interpolation, and then the RK 
voicing source model is time-shifted by all pass filters so that 
its negative peak positions are adjusted to the negative peak 
position of the speech segment or the voicing source waveform. 
This allows a highly accurate estimation of GCIs, resulting in 
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increased accuracy in estimating of voicing source parameters 
a vocal tract transfer function. 
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EMBODIMENT 3 

FIG. 12 is a block diagram illustrating the configuration 
of a speech synthesis system according to a third embodiment of 
the present invention. The speech synthesis system generates a 
synthesized speech in accordance with-Equation ( 2 ) , and includes 
an RK model voicing source generation unit 12001, a voicing source 
spectral tilt filter (TL(z)) 12002, a vocal tract spectral tilt 
filter (D(z)) 12003, a vocal tract filter (B(z)/A(z)) 12004, a 
white noise generation unit 12005, a white noise filter (1/A(z) ) 
12006 and a mixing unit 12007 . 

A speech, which has been analyzed by the speech analysis 
system according to the first or second embodiments of the present 
invention is represented as the following parameter for each 
analyzed frame, and then transmitted to the speech synthesis 
system. 
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Types of Parameters 


Name 


Meaning 


Voicing source 
parameter 


AV 


Amplitude of RK voicing 
source model 


OQ 


vocal cor a uptriixiiy i**^ 
of RK voicing source 
model 


FO 


Fundamental rrequency ox 
RK voicing source model 


TL 


Spectral tilt rate 


NA 


Amplitude of white noise 


Spectral tilt 
compensation rate 
filter 


rn T 
X X 


Soectral tilt 
compensation rate 


Formarit 


Fl~F6 


Center frequency of 1 st 
through 6 th formants 


Bl~B6 


Bandwidth of 1 st through 
| 6 th formants 



0 only in voiced parts and 0 in voiceless parts. On the other 
hand, NA takes some value other than 0 only in voiceless parts 

5 and 0 in voiced parts. 

The RK model voicing source generation unit 12001 uses the 
parameters AV, OQ and FO to generate a voicing source waveform 
according to Equation (13). The voicing source spectral tilt 
filter 12002 uses the parameter TL to modify the spectral tilt 

10 of the voicing source waveform from the RK model voicing source 
generation unit 12001 according to Equation ( 5 ) . The vocal tract 
spectral tilt filter 12003 uses the parameter TI to compensate 
a spectral tilt according to Equation (12). The voicing source 
waveform whose spectral tilt has been compensated by the vocal 

15 tract spectral tilt filter 12003 is supplied to the mixing unit 
12007 via the vocal tract filter 12004 . Specifically, the voicing 
source waveform according to the first term of the right side of 
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Equation ( 2 ) is supplied to the mixing unit 12007 . The white noise 
generation unit 12005 generates a random noise at a gain dependent 
on the parameter NA. The random noise generated from the white 
noise generation unit 12005 is supplied to the mixing unit 12007 
5 via the white noise filter 12006 . Specifically, a noise waveform 
according to the second term of the right side of Equation (2) 
is supplied to the mixing unit 12007. The mixing unit 12007 
synthesizes the voicing sourcewavefomrfrom the vocal tract filter 
3 12004 and the noise waveform from the white noise filter 12006 
S 10 and thereby generates a synthesized speech signal according to 
2? Equation (2 ) . 

S As has been described above, in the speech synthesis system 

S according to the third embodiment, it is possible to synthesize 

2 a speech with high sound quality which sounds very close to the 

3 15 original speech sound by separately synthesizing parameters, which 

have been estimated by the speech analysis system according to 
the first and the second embodiments, for each frame. 

EMBODIMENT 4 

20 a speech synthesis system according to a fourth embodiment 

of the present invention includes an RK model voicing source 
generation unit 13001 shown in FIG. 13 instead of the RK model 
voicing source generation unit 12001 shown in FIG. 12. Other 
structures are the same as in the RK model voicing source generation 

25 unit shown in FIG. 12. The RK model voicing source generation 
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unit 13001 shown in FIG. 13 includes an RK model voicing source 
generation unit 12001, a DFT (discrete Fourier transformation) 
calculation unit 13002, a DFT modification unit_13003 f an IDFT 
( inverse discrete Fourier transformation ) calculation unit 13004 , 
5 a stationary delay calculation unit 13005, a random delay 
calculation unit 13006 and a synthesis unit 13007. 

The RK model voicing source generation unit 12001 is 
equivalent to one shown in FIG. 12 . The DFT calculation unit 13002 
I perform DFT on the voicing source waveform from the RK model voicing 
n 10 source generation unit 12001 into a frequency domain according 

ft 

y to Equation (23). The stationary delay calculation unit 13005 

M uses the parameter F0 to calculate the delay 0/*) according to 

Equation ( 24 ) . The random delay calculation unit 13006 calculates 
the random delay e r (*) according to Equation ( 25 ) . The synthesis 
unit 13007 adds the stationary delay e,(k) to the random delay 
& r (k) and then supplies the sum ( 0,(fc) - @ r (k) ) to the DFT modification 
unit 13003 . The DFT modification unit 13003 modifies the voicing 
source waveform, which is now in a frequency domain, from the DFT 
calculation unit 13002 according to the second equation of Equation 
( 22 ) . The IDFT calculation unit 13004 performs IDFT on the voicing 
source waveform in the frequency domain, which has been modified 
by the DFT modification unit 13003, to return the voicing source 
waveform to a time domain according to the first equation of Equation 
(22). 

25 By adding a fluctuation to the speech segment in the manner 
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as described above, it is possible to: 

1) accurately control glottal closure timing; and 

2) prevent buzzy sound quality. 

5 EMBODIMENT 5 

FIG. 14 is a block diagram illustrating the configuration 
of a speech synthesis system according to a fifth embodiment of 
the -present invention. -The speech synthesis system illustrated 
2 in FIG. 14 further includes a formant connection unit 14001 in 

m 10 addition to the members of the configuration of the speech synthesis 

Cj system shown in FIG. 12. The formant connection unit 14001 

rn ... 
SJ optimizes formant connections taking into account the continuities 

Q for formant parameters Fl through F6 and Bl through B6 between 

: iii 

□ adjacent frames. The formant connection unit 14001 determines 

□ 15 the correspondence of formants between the frames by dynamic 

programming using a connection cost and a disconnection cost shown 
in Equations (26) and (27). 

Hereinafter, the dynamic programming operation will be 

described in detail. 

20 FIG. 15 illustrates the formant frequencies and bandwidths 

of two adjacent frames. The abscissa indicates frame numbers and 
the ordinate indicates frequencies. The frequency and the 
bandwidth of each formant are indicated in values, which are shown 
as (Frequency, Bandwidth) . Two frames (Frame A and Frame B) have 

25 six formants each. These formants in each frame are called Fl, 
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F2 and the like in the order of increasing frequency. Normally, 
among these sets of six formants, ones with the same number in 
Frame A and Frame B are connected each other. However, the 
frequencies of F2 and F3 in Frame B are close each other, and both 
are close to the frequency of F2 in Frame A. Also, the bandwidth 
of F2 in Frame B takes a considerably large value. A formant with 
a broad bandwidth is low in intensity, and thus the formant is 
considered as one that~is disappearing or appearing . Accordingly , 
F2 in Frame B is considered as one that is appearing, and it is 
therefore desirable that F2 in Frame B is not connected with F2 
in Frame A. In this case, F2 in Frame A should be connected with 
F3 in Frame B. Dynamic programming is used for automatically 
determining this kind of matters . 

FIG. 16 plotted formants in Frame A as the abscissa and 
formants in Frame B as the ordinate, and grid points are indicated 
therein by coordinates (1,1), (1,2) and the like. In the figure, 
each formant is given its values of frequency and intensity in 
the form of ( frequency, intensity ) . The intensity of each formant 
is represented by a value obtained by transforming the bandwidth 
thereof according to Equation (28). 

The two frames has six formants each and therefore the number 
of grid points reaches 36 from (1,1) through (6,6). And, in the 
figure, an additional point (7,7) is given. Assume that a path 
extends from (1,1) toward (7,7), passing through grid points . For 
example, as shown in FIG. 17, a path which passes through points 
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(1,1), (2,2), {3,3), (4,4), (5,5), (6,6) and (7,7) can be drawn, 
in this case, the point (1,1) corresponds to Fl in Frame A and 
Fl in Frame B, and the point (2,2) and the subsequent ones likewise. 
Accordingly, when the path described above is drawn, the six 
formants from Fl through F6 are all connected with their 
counterparts with the same number. However, as shown in FIG. 18, 
for example, a path which passes through the points (1,1), (2,3), 
(3,4), (5,-5), (6,6) and (7,7) can-be also drawn. This means that 
F2 in Frame A and F3 in Frame B are connected and that F3 in Frame 
A and F4 in Frame B are connected . F4 in Frame A and F2 in Frame 
B do not have counterparts to be connected with. It is considered 
that F4 in Frame A is a disappearing formant and F2 in Frame B 
is appearing one. 

As has been described above , formant connection is determined 
depending on what path pattern is selected. The selection of a 
path pattern is made using a method for reducing a cost based on 
the distance between formant frequencies and the distance between 
formant bandwidths and a cost based on a shift from one grid point 
to another. 

First, as shown in FIG. 19, a shift is constrained. 
Specifically, assume that only four points (i-l,j-l), (i-2,j-l), 
(i_l,j_2) and (i-2,j-2) can be shifted to the point (i, J). A 
shift from ( i-1 , j-1 ) is called A, a shift from ( i-2 , j-1 ) is called 
B, a shift from (i-1, j-2)is called C and a shift from (i-2, j-2) 
is called D. According to the constraints, grid points through 



which the path can pass during it starts at ( 1 , 1 ) and ends at (7,7), 
are obviously restricted to ones shown in FIG. 20 among all the 
grid points. 

Hereinafter, the steps of path search will be described with 
reference to FIG. 21. 
<Step Sl> 

First, the numbers of formants in Frame A and Frame B are 
set at NA and NB, respectively. An array C having a size of NA 
XNB and arrays ni and nj both having a size of (NA+1 ) X (NB+1 ) 
are prepared, and then elements of the arrays are all initialized 
to 0. C(i,j), which is the element of C, is used for storing the 
cumulative cost at the point (i,j). Also, ni(i,j), which is the 
element of ni, and nj(i,j), which is the element of n j , are used 
for storing a path which has been shifted at a minimum cumulative 
cost, i.e., an optimum path to the point (i, j). In other words, 
when the point right before the point (i, j) on the optimum path 
to the point (i, j ) is a point (m,n) , ni(i, j )=m and nj (i, j )=n hold. 

<Step S2> 

The cumulative costs and optimum paths for all possible grid 
points are calculated (see FIG. 20). 

Both a counter i and a counter j are initialized to 1 . i 
and j are used as the respective indexes of Frame A and Frame B. 

<Step S3> 
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Cost calculation is made for four poss ible points (m, n ) which 
can be shifted to the point (i,j) (see FIG. 19). 

A counter m and a counter n are prepared and initialized 
to m=i-2 and n=j-2, respectively. Also, Cmin is prepared for 
5 calculating the minimum cumulative cost and previously replaced 
with as large a value as possible. 

<Step S4> 

If the point (m,n) is not contained in the set of possible 
grid points shown in FIG. 20, the process proceeds with Step S8. 
If it is so, the process proceeds with Step S5. 

<Step S5> 

Ctemp is prepared for temporarily storing a cumulative cost, 
and stores the sum of a path cost taken for shifting from point 
(m,n) to point(i,j) and the cumulative cost at the point (m,n) . 

<Step S6> 

If Ctemp is smaller than Cmin (Yes), the process proceeds 
20 with Step S7. If not (No), the process proceeds with Step S8. 

<Step S7> 

Cmin is replaced with Ctemp, and m is stored in ni(i, j ) and 
n in nj(i, j) . ni(i, j) stores the Frame A coordinate at the point 
25 which has been shifted to the point (i, j) at a minimum cumulative 
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cost, and nj(i,j) stores the Frame B coordinate at the same point. 
<Step S8> 

If n=j-l holds (Yes), the process proceeds with Step S10. 
If not (No), the process proceeds with Step S9. 

<Step S9> 

n is incremented by 1 and then the process returns to Step 

S4. 

<Step S10> 

If m=i-l holds (Yes), the process proceeds with Step S12. 
It not (No), the process proceeds with Step Sll. 

<Step Sll> 

n is set at j-2 again, m is incremented by 1 and then the 
process returns to Step S4. 

<Step S12> 

If i has reached NA+1 (Yes), the process ends . If not (No), 
the process proceeds with Step S13. 

<Step S13> 

The cumulative cost is stored in C(i,j). Specifically, 
stored therein are the sum of the formant distance at the point 
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(if j) (the value obtained according to Equation (26)) and Cmin. 
Note that since the point (1, 1) is the starting point of the path, 
no path cost exists and thus only its formant distance is stored. 

5 <Step S14> 

If j has reached NB (Yes), the process proceeds with Step 
S16. If not (No), the process proceeds with Step S15. 

W <Step S15> 

; S 10 -i is incremented by 1 and then the process returns to Step 



! J i ss 



n <Step S16> 

If i has reached NA (Yes), the process proceeds with Step 



15 SI 8. If not (No), the process proceeds with Step Si 7. 
<Step S17> 

j is set at 1 again, i is incremented by 1 and then the process 
returns to Step S3 . 

20 

<Step S18> 

Lastly, calculated is the point which will be shifted to 
the endpoint (NA+1, NB+1) at the minimum cumulative cost. 

i=NA+l and j=NB+l are set and then the process returns to 
25 Step S3. 
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The path cost is calculated in the following manner. The 
number of allowed paths is four: A, B, C and D shown in FIG. 19. 
If the i th formant in Frame A is expressed by FA(i)and the j th formant 
5 in Frame B is expressed by FB( j ) , as for Path A, FA (i-1 )is connected 
with FB(j-l) and FA(i) is connected with FB( j ) and no disconnected 
formant exists. Therefore, the path cost (in other words , the 
disconnection cost) becomes 0. As for Path B, FA (i-1) does not 
have a counterpart to be connected with. In such a case, the path 
10 cost is calculated by substituting the intensity of FA (i-1) in 

t\ Equation (27). As for Path C, in contrast, FB (j-1) does not have 

ft 

It a counterpart to be connected with. Thus, the path cost is 

calculated by substituting the intensity of FB (j-1) in Equation 
(27). As for Path D, both FA (i-1) and FB (j-1) do not have 
15 counterparts to be connected with. Then, the path cost is the 
sum of the value obtained by substituting the intensity of FA ( i-1 ) 
in Equation (27) and the value obtained by substituting the 
intensity of FB (j-1) in Equation (27). 

It will be described how an actual cost is obtained using 
20 the calculations described above. 

FIG. 22 illustrates the point (i, j) and four points (i-1, j-1) , 
(i-2,j-l), (i-l,j-2) and (i-2,j-2) which can be shifted to the 
point (i,j). The arrows represent shifts from the four points 
to the point (i, j), and the path names A, B, C and D, which have 
25 been defined in FIG. 19, are indicated at respective point ends 
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of the arrows . Also , in the circles which represent the four points , 
the respective cumulative costs at those points are indicated. 

The numerals framed by square , each located in about the 
middle of the arrow which represents the path , indicate path costs . 
For example , the path cost of Path B is calculated according to 
Equation (27) using the intensity of F3 in Frame A which has lost 
its counterpart to be connected due to the shift , and the calculation 
result becomes 11. 

The respective cumulative costs (Ctemp which is calculated 
in Step S5 ) taken when the four points reach the point ( i , j ) through 
the corresponding four paths are indicated around the respective 
end points of the arrows. Specifically, the cumulative cost is 
a value obtained by adding a path cost taken for the shift to a 
cumulative cost at the point from which the shift originates. 

As a result, the cumulative costs 4035, 483, 5351 and 1179 
are obtained for Paths A, B, C and D, respectively, and Path B 
having the smallest cumulative cost is selected (Step S7) . FIG. 
23 illustrates how Path B has been selected. As Path B has been 
selected, the i coordinate at the starting point of Path B is stored 
in ni( i, j ) and the j coordinate thereof is stored in n j ( i, j ) . Also, 
at the point (i, j), 665 is indicated which is the cumulative cost 
obtained by adding 182 having been obtained by calculating the 
formant distance at the point (i,j) from Equation (26) to the 
cumulative cost based on Path B(Step S13). 

In this manner, partial optimum paths are consecutively 



37 



lH 10 

m 

l3t 



Hli 
□ 

H 

m 15 



obtained through respective cost calculations for every grid point 
on their way from (1,1) to (NA+1, NB+1). Thereafter, the aggregate 
optimum path from (1,1) to (NA+1 , NB+1 ) can be obtained by tracing 
ni and ni from the end point to the starting point. The optimum 
path which has been obtained is indicated in FIG. 24. Also, it 
is illustrated in FIG. 25 how the formants shown in FIG. 15 are 
connected as a result of path search. As for formants which are 
connected with each other ilike Fl in Frame A and Fl in Frame B, 
the formant filters are smoothly changed with time. Since F2 in 
Frame A has no counterpart to be connected, the center frequency 
of its formant filter is not changed but the intensity is gradually 
changed to 0, to smoothly disappear. In contrast, as for F2 in 
Frame B, the intensity is gradually increased from 0, to smoothly 
appear . 

In order to change the intensity smoothly, Fi is changed 
at a constant rate. By solving Equation (28) forFb, the following 
equation is obtained. 



This equation may be used to transform Fi to Fb to calculate the 
filter coefficients. 

As has been described, in the speech synthesis system 
according to the fifth embodiment, DP matching is used to carry 
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if formant 
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out the optimum formant connection and thereby a disappearing 
formant and an appearing formant can be properly expressed. 

EMBODIMENT 6 

As has been described in the fifth embodiment , some formants 
are caused to disappear or appear , which requires to re-allocate 
formant filters in each frame. FIG. 26 shows Frame A and Frame 
B shown in FIG. 25 and frames around them. For the sake of simplicity , 
only Fl through F3 and their vicinity are shown. The four 
successive frames shown in FIG. 26 includes same Frame A and Frame 
B as shown in FIG. 25. The frames which have Frame A and Frame 
B therebetween are indicated as Frame AA and Frame BB. Between 
Frame A and Frame B, neither F2s nor F3s are connected according 
to the method described in the fifth embodiment. The 
disconnections are expressed by Xs in FIG. 26. It is understood 
that a disconnected formant either disappears toward a formant 
with the same frequency and a very low intensity or appears from 
such a formant. 

In order to embody the above concept, formants having no 
counterpart to be connected are connected with formants having 
an infinitely large bandwidth (i.e., an intensity of 0) as shown 
in FIG. 27. Black circles in FIG. 27 indicate the formants with 
an infinitely large bandwidth. By doing so, the filters can be 
smoothly changed while frequencies and bandwidths of formants are 
interpolated between Frame A and Frame B, and thereby a desired 



39 



spectrum can be realized. 

However , since Frame AA and Frame A are different in the 
number of formants from each other, a smooth filter change 
therebetween can not be realized by a simple interpolation . Frame 
AA and Frame BB are each implementable by cascade connection of 
three filters as shown in FIG. 28A. In FIG. 28, the formant filters 
are represented by FFl , FF2 and the like from the left. As for 
Frame A and Frame B, however, five filters -have to be connected 
in cascade. Supposing that Fls are not connected with each other, 
six filters at most are connected in cascade. Fig. 28B illustrates 
the state of a cascade connection of six filters. 

Here, for the sake of simplicity , quadratic mono-pole filters 
are used as the formant filters. In the upper part of FIG. 28, 
the inside of one of the filters is shown on an enlarged scale. 
Dl and D2 are delay elements which store a single-step vale. The 
transfer function is as follows: 

a( z )_ J 

Fl in Frame AA is straightly connected with Fl in Frame A 
but F2 in Frame AA is connected with F3 in Frame A. Therefore, 
in this case, allocation of the filters must be take into account. 
Thus, the six filters are kept connected in cascade at any time 
and the following steps are carried out during the period from 
Frame AA to Frame A. 

(l)ln Frame AA, since only three filters are needed, Dl and 
D2 are cleared to 0 at the filters FF4 through FF6, so 
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that 3=0, b=0 and c=0 are obtained. Then, an equivalent 
state to the one where the filters are bypassed can be 
achieved. AtFFl, FF2 and FF3 , a, b, and c are calculated 
from the respective frequencies and bandwidths of Fl, 
5 F2 and F3 . 

(2)Between Frame AA and Frame A, the frequencies and the 
bandwidths are consecutively calculated according to the 
respective paths of connected f ormants and thereby filter 
^ properties are smoothly changed. 

10 ( 3 ) At the point of Frame A, the allocation of formant filters 

"j is modified. FFl in the previous frame is allocated to 

Sj Fl in Frame A. Meanwhile, FF2 is allocated to F2 in Frame 

n A. However, F2 in Frame AA is shifted to F3 at the point 

\U 

Q of Frame A. In Frame A, if FF2 is allocated to F2, filter 

Q 15 coefficients abruptly change and therefore click noise 

m 

is generated. Thus, a, b and c which are the coefficients 
of FF2 in the previous frame and the values for Dl and 
D2 which represent their inside states are copied into 
FF3 , and FF2 is allocated to F2 which has newly appeared. 
20 The operation shown above will be described more specifically 

with reference to FIG. 29. 

FIG. 29 shows changes in configuration of formant filters 
in Frame AA, Frame A, Frame B and Frame BB . In each cell for formant 
filters , three numbers are indicated . The three numbers represent 
25 the formant frequency and the formant bandwidth of a formant filter, 
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and the number (connection number) of a counterpart in the previous 
frame which has been connected with the formant filter, 
respectively. 

For example, the connection number of FF1 in Frame A is 1 . 
> This means that FFl in Frame AA has been straightly connected with 
FFl in Frame A. However, the connection number of FF3 in Frame 
A is not 3 but 2 . This means that FF2 in Frame AA has been connected 
with^FF3 in Frame A. Also, the connection number of FF2 in Frame 
A is 0, which indicates that no filter in Frame AA to be connected 
S 10 with FF2 in Frame A exists and therefore that FF2 is a formant 
which has newly appeared in Frame A. In Frame BB, no formant having 
the connection number 3 exists . This means that no counterpart 
to be connected with F3 in Frame B exists in Frame BB and that 
F3 in Frame B has disappeared. The formants, in which all the 
15 three numerical values are 0, are ones that does not need functions 
as a filter and will be bypassed, that is, the coefficients of 
the filter are a=l , b=0 and c=0. 

At the time when the state shifts from Frame AA to Frame 
A, the filters are re-allocated in accordance with the steps shown 
20 in FIG. 30. 

Repeat from FF6 toward FFl in order (Step S31 through Step 
S39) 

if a connection number is 0 (Step S32) 
clear Dl and D2 (Step S33). 

25 else 
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assuming that the connection number is N, copy 
Dl and D2 of the Nth formant filter FFN. 

endif 

calculate a, b and c from formant frequency and 
5 bandwidth to set the resultant a, b and c (Step S3 6) . 

Note that when formant frequency and bandwidth are 
both 0,, a=l, b=0 and c=0 (Step S37). 
finish repeating the steps. 

As has been described above, since the speech synthesis 
10 system according to the sixth embodiment has a mechanism for 
modifying the configuration of a filter cascade connection 
according to the result of an optimum formant connection by DP 
matching , it is possible to smoothly reproduce a spectrum according 
to formant s which have been optimally connected by DP matching, 
15 prevent generation of click noise and discontinuity of a waveform 
and therefore synthesize a smooth speech. 
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WHAT IS CLAIMED IS: 

1 . A speech synthesis system, which synthesizes speech using 
time series data of formant parameters (including a formant 
frequency and a formant bandwidth) estimated based on a speech 
production model , the speech synthesis system comprising 
determining the correspondence of formant parameters between 
adjacent frames using dynamic programming. 

2. The speech synthesis system of Claim 1, wherein in 
determining the correspondence of the formant parameters , a 
connection cost d c (F(n) , F(n+1)) and a disconnection cost dd(F(k) ) 
are obtained using the equations: 

d c (F («), F(n + 1)) = a\F f (n) -F f (n+ 1)| + fifc (n) - F ; (n + 1)| 

d d (F(k)) - a\F f (k) - F f (k)\ + fifc (k) - s\ 
= p\ Fi (k)-e\ 

where a and pare predetermined weight coef f icients , F f (n) is a 
formant frequency in the n th frame, that F ± (n) is a formant intensity 
in the n th frame and e is a predetermined value, and the resultant 
d c (F(n) , F(n+1) ) andd d (F(k) ) areused as costs for grid point shifting 
in dynamic programming. 

3. The speech synthesis system of Claim 2, wherein for two 
adjacent frames in which exists a formant which has no counterpart 
to be connected, 

a formant having the same frequency as that of the disconnected 
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formant in one of the frames and an intensity of 0 is located in 
the other frame and 

the two adjacent frames are connected by interpolation of 
frequencies and intensities of both the formants according to a 
smooth function. 

4 . The speech synthesis system of Claim 2 , wherein the formant 
intensity Fi(n) is calculated using 



where F b (n) is a formant bandwidth in the n th frame and F s is a sampling 
frequency . 

5. The speech synthesis system of Claim 3, wherein a vocal 
tract transfer function including a plurality of formants is 
implemented by a cascade connection of a plurality of filters and 

wherein when a formant which has no counterpart to be connected 
exists in the adjacent frames and thus the connection of the filters 
needs to be changed, 

a coefficient and an internally stored data of the filter 
in question are copied into another filter and 

the first filter is then over written with a coefficient and 
an internally stored data of still another filter or initialized 
to predetermined values . 




, if anti - formant 



, if formant 
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6. The speech synthesis system of Claim 4, wherein a vocal 
tract transfer function including a plurality of formants is 
implemented by a cascade connection of a plurality of filters and 

wherein when a f ormant which has no counterpart to be connected 
exists in the adjacent frames and thus the connection of the filters 
needs to be changed , 

a coefficient and an internally stored data of the filter 
in question are copied into another 'filter and 

the first filter is then over written with a coefficient and 
an internally stored data of still another filter or initialized 
to predetermined values • 

7 . A speech analys is method , in which a sound source parameter 
and a vocal tract parameter of a speech s ignal waveform are estimated 
by using a glottal source model including an RK voicing source 
model, the speech analysis method comprising the steps of: 

extracting an estimated voicing source waveform using a 
filter which is constituted by the inverse characteristic of an 
estimated vocal tract transfer function; 

estimating a peak position corresponding to a GCI (glottal 
closure instance) of the estimated voicing source waveform with 
higher accuracy at closer time intervals than that with the sampling 
period by applying a quadratic function; 

synthesizing the GCI with a sampling position in the vicinity 
of the estimated peak position and thereby generating a voicing 
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source model waveform; and 

time-shifting the generated voicing source model waveform 
with higher accuracy at closer time intervals than that with the 
sampling period by means of all pass filters and thereby matching 
the GCI with the estimated peak position. 

8. A speech analysis method, in which a voicing source 
parameter and a vocal tract parameter of a speech signal waveform 
are estimated by using a glottal voicing source model such as an 
RK model or a model defined as a modified model thereof, the speech 
analysis method comprising the steps of: 

extracting an estimated voicing source waveform using 
filters which are constituted by the inverse characteristic of 
an estimated vocal tract transfer function; and 

assuming the first harmonic level as Hi and the second 
harmonic level as H2 in DFT (discrete Fourier transformation) of 
the estimated voicing source waveform and estimating an OQ (open 
quotient) from a value for HD defined as HD=H2-Hl. 

9. The speech analysis method of Claim 8, wherein for 
estimating the OQ, the relation: 

OQ=3 . 65HD-0 . 213HD 2 +0 . 0224ifD 3 +50 . 7 
is used. 
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ABSTRACT OF THE DISCLOSURE 

A speech segment to be analyzed is cut out with a window 
having a length of a plurality of pitch periods for RK model voicing 
source parameter estimation. GCIs are all estimated for a 
plurality of voicing source pulses. Based on such estimations, 
an RK model voicing source waveform is generated, its relationship 
with the speech segment is analyzed by ARX system identification, 
and then a glottal transform function is estimated. While this 
process repeated, when GCIs converge at a predetermined value, 
the identification is completed. Accordingly, a high quality 
analysis-synthesis system, which isolates voicing source 
parameters of speech signals from vocal tract parameters thereof 
with high accuracy, can be realized. 
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